\section{Picking $\epsilon$}
\label{sec:pickingeps}

% Pose the question

So far, we have discussed how to bound the error on the KS-statistic in comparison with the error ($\epsilon$) of the quantile $\epsilon$-sketch. The obvious question that arises (and that we address in this section) is:
\vspace{.15cm}
\begin{quote}
What types of error bounds are necessary for computing the KS-statistic in practice? 
\end{quote}
\vspace{.15cm}

% We address this question in this section.

% Specify what the goal is more formally

Recall that the form of the one-sample KS test is to compute the KS-statistic $D$ and then find the significance level $\alpha$ at which $\sqrt{n}D > K_{\alpha}$. For instance, if the desired significance level is $\alpha = 0.05$, then we reject the hypothesis that the data is drawn from the distribution being tested against exactly when $\sqrt{n}D > K_{0.05} \approx 1.358$. On the other hand, if this value were to exceed $K_{0.01} \approx 1.628$, then we could reject the hypothesis at the $0.01$ significance level. Clearly, our goal should be to select the absolute error of the quantile $\epsilon$-sketch to be sufficiently low so as not to adversely affect this comparison---to (not) reject the null hypothesis erroneously.

% Talk about how low the error should be (0.1 should be pretty good)

Suppose that the KS test is being applied to reject the null hypothesis at the $\alpha = 0.05$ significance level. Then, the test simply needs to check if $\sqrt{n}D > K_{0.05} \approx 1.358$. The error introduced by using our sketch, rather than all of the data in the stream, should be small enough to not degrade the quality of the test. The form of approximation this would take is that we should always reject the hypothesis at the $\alpha = 0.04$ (or lower) significance level, but never reject it at the $\alpha = 0.06$ (or higher) significance level. Since $K_{0.04} = 1.399$ and $K_{0.06} = 1.324$, it is clear that in this example an error in $\sqrt{n}D$ of up to $0.03$ can be tolerated. 

% Solve for desired memory requirement in one-sample test

Let us say that our goal is to compute the quantity $\sqrt{n}D_n$ to within $\delta$ precision. To achieve this level of accuracy, we have to determine how many samples are necessary in the quantile $\epsilon$-sketch. For the purposes of this analysis we will use the Greenwald-Khanna sketch as an example since it is considered the state-of-the-art for computing quantiles on a stream. The Greenwald-Khanna sketch needs $O(\frac{1}{\epsilon}\log{(\epsilon n)})$ samples to guarantee an error of $\epsilon$. Now, recall from Section~\ref{sec:onesample} that an error of $\epsilon$ for a sketch translates to an error of $3\epsilon$ for our estimate of $D_n$. Hence, we need to pick $\epsilon$ such that  $3\epsilon\sqrt{n} \leq \delta$. Choosing $\epsilon = \delta/(3\sqrt{n})$ suffices, so we substitute this into the memory requirement of the sketch to get that $O(\frac{\sqrt{n}}{\delta}\log{(\delta\sqrt{n})})$ space is required, which turns out to be $O(\sqrt{n}\log{n})$ for constant $\delta$ (e.g., $0.03$ in the above example). Thus, the overall space needed by the one-sample test is $O(\sqrt{n}\log{n})$. For example, a terabyte of data would get compressed down to the order of megabytes. 

% Similar result for two-sample

The result for the two-sample test is similar. Recall that the two-sample test allows one to reject the hypothesis at the $\alpha$ significance level when $\sqrt{\frac{nm}{n + m}}D > K_{\alpha}$, where $n$ and $m$ are the lengths of the two streams. Let us say, without loss of generality, that $n \geq m$. We then have that $\sqrt{\frac{nm}{n + m}} \leq \sqrt{\frac{n^2}{n}} = \sqrt{n}$ and an analysis similar to that above shows that $O(\sqrt{n}\log{n})$ space is needed by the quantile sketches to reliably perform this test. Unfortunately, this space requirement is needed for both the $n$ and the $m \leq n$ length stream, which means that in the case that $m \ll n$ the space requirement of the smaller stream may become unreasonable. Hence, the two-sample test is only feasible when $n$ and $m$ are not too far apart in magnitude. This is demonstrated in the experimental section.

% Compare with random sampling

We next compare our $O(\sqrt{n}\log{n})$ result with that for random sampling. There is a folklore result that random sampling needs $\theta(1/\epsilon^2)$ samples to $\epsilon$-approximate the quantiles~\cite{BS09}. Substituting the $\epsilon \leq \delta/(3\sqrt{n})$ requirement from above into this formula gives us that this amounts to $\Omega(n)$ samples. Hence, we expect that sampling should need many more samples than using the Greenwald-Khanna sketch to attain the same level of accuracy.

% Talk about limitation of this approach as at least 1/\eps samples needed

% The analysis above also shows a limitation of the techniques introduced in this paper. There is a trivial lower bound of $\Omega(\frac{1}{\epsilon})$ on the number of samples needed by any quantile $\epsilon$-sketch. As a result, even with further improvements in the development of these sketches, because of the $\epsilon \leq \delta/(3\sqrt{n})$ requirement, we can not expect the algorithms in this paper to use $o(\sqrt{n})$ space to reliably perform the KS test. We leave as an open question whether other algorithms can use $o(\sqrt{n}\log{n})$ space to perform KS tests in a stream.
